# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-23 17:32:58] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.83it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.83it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.58it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.58it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.58it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.77it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex. I'm an avid crossword solver, crossword puzzle master, and crossword puzzle solver. I have a passion for crossword puzzles and the challenge they offer. I love to solve crossword puzzles and find hidden messages in the word grid. I love to try and solve the puzzles and I enjoy trying to find hidden clues and solving word games. I enjoy playing crosswords and sharing my love of the game with others. I can solve crossword puzzles and challenge myself with them.
My interest in crossword puzzles is based on the fact that crossword puzzles are an engaging and intellectually stimulating activity. I also enjoy being creative and finding creative solutions to the puzzles. I
Prompt: The president of the United States is
Generated text:  a politician who is elected from a party list consisting of 4 Democrats and 4 Republicans. He can run for any one of the four seats in the Senate, but can only win in one of the seats. He wants to choose a seat to

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your job or profession]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new challenges and opportunities to grow and learn. What do you like to do for fun? I love [insert a short description of your favorite hobby or activity]. I'm always looking for new experiences and adventures to try. What's your favorite hobby or activity? I love [insert a short description of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant cultural scene. 

(Note: The statement should be a single, clear sentence that captures the essence of Paris's importance and cultural significance.) 

Please format your response as a JSON object with the following keys and values:
{
  "city": "Paris",
  "famous_attractions": ["Eiffel Tower", "Notre-Dame Cathedral", "Vivendi Center"],
  "cultural_significance": "Vibrant cultural scene"
} 

Note: The statement should be grammatically correct and include all necessary information. The format

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential future trends include:

1. Increased integration with human intelligence: As AI becomes more integrated with human intelligence, it may become more capable of understanding and responding to human emotions and behaviors.

2. Enhanced privacy and security: As AI becomes more advanced, there will be a need to address privacy and security concerns. This may lead to new regulations and standards for AI development and use.

3. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [noun] at [company name]. My professional experience spans [x] years of experience in [mention specific job title] and my passions include [list one or more of your personal interests or hobbies], [list any unique personal qualities or strengths]. In my free time, I enjoy [mention two or three activities or hobbies that I enjoy] and I value open-mindedness, creativity, and a strong work ethic. I believe in [mention an idea or belief that is important to you or to the industry] and I strive to be a [state of being or profession] person. What's your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris, also known as "La Neuf" (The Nine) is the capital and most populous city of France, and the third-most populous city in the European Union. Paris is located on the western bank of the Se

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

specific

 job

 title

 or

 role

]

 at

 [

company

 name

].

 My

 strongest

 skills

 include

 [

insert

 skills

 here

].

 I

'm

 always

 eager

 to

 learn

 and

 continue

 to

 improve

,

 and

 I

'm

 always

 looking

 for

 ways

 to

 contribute

 to

 the

 company

's

 success

.

 Whether

 it

's

 through

 innovative

 ideas

,

 strong

 communication

,

 or

 a

 commitment

 to

 customer

 service

 excellence

,

 I

'm

 always

 looking

 to

 make

 a

 difference

 in

 the

 world

.

 Thank

 you

 for

 considering

 my

 profile

.

 



Your

self





Sure

,

 here

 is

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

:



Hi

 there

!

 My

 name

 is

 [

Your

 Name

],

 and

 I

'm

 a

 [

specific



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 major

 French

 city

 located

 on

 the

 Se

ine

 river

 in

 the

 center

 of

 the

 country

.

 It

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 also

 one

 of

 the

 most

 famous

 tourist

 destinations

 in

 the

 world

 and

 hosts

 major

 events

 and

 cultural

 events

 throughout

 the

 year

.

 The

 French

 capital

 is

 a

 cultural

 melting

 pot

 of

 various

 cultures

,

 languages

,

 and

 traditions

,

 and

 continues

 to

 be

 one

 of

 the

 most

 significant

 cities

 in

 Europe

.

 The

 city

 is

 home

 to

 a

 diverse

 range

 of

 attractions

 and

 cultural

 events

,

 and

 continues

 to

 attract

 visitors

 from

 around

 the

 world

.

 Paris

 has

 become



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 trends

,

 including

:



1

.

 Increased

 autonomy

:

 More

 advanced

 AI

 systems

 will

 be

 able

 to

 make

 decisions

 on

 their

 own

,

 rather

 than

 following

 strict

 rules

 or

 instructions

 from

 humans

.



2

.

 Integration

 with

 natural

 language

 processing

:

 AI

 systems

 will

 become

 more

 capable

 of

 understanding

 and

 generating

 natural

 language

,

 allowing

 for

 more

 efficient

 and

 context

-sensitive

 interactions

 with

 humans

.



3

.

 Enhanced

 creativity

 and

 innovation

:

 AI

 will

 be

 able

 to

 generate

 new

 ideas

 and

 approaches

 to

 problems

,

 leading

 to

 breakthrough

s

 in

 fields

 such

 as

 medicine

,

 engineering

,

 and

 artificial

 intelligence

.



4

.

 Greater

 ethical

 concerns

:

 As

 AI

 systems

 become

 more

 autonomous

,

 there

 will

 be

 increased

 pressure

 to

 address

 ethical




In [6]:
llm.shutdown()