# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0918 00:51:32.070000 1628492 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0918 00:51:32.070000 1628492 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0918 00:51:42.900000 1629041 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0918 00:51:42.900000 1629041 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0918 00:51:42.929000 1629042 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0918 00:51:42.929000 1629042 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-18 00:51:43] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.39it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.32it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.32it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.32it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.24it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Daniel, and I'm a 17-year-old college student at a local university. I'm currently studying English literature and have been reading extensively about the history of literature and literary analysis.

Could you please write an essay on the topic of "literary devices" and explain them in detail with examples. Additionally, please include the literary devices used in a passage from Shakespeare's play "Hamlet". The essay should be 1500 words long and should incorporate MLA format for citations. As an AI, I'm not programmed to write essays, but I'd be happy to assist you with any other writing related to literature, history
Prompt: The president of the United States is
Generated text:  elected by the members of the legislature, and the president-elect is sworn in by the Speaker of the House. These actions are conducted under the law of the United States. There are also the events of the American revolution. Which of the following is a statement th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant nightlife. It is also a major center for art, music, and culture. Paris is a city of contrasts, with its historical landmarks and modern architecture blending seamlessly. It is a UNESCO World Heritage site and a major tourist destination. The city is home to many museums, theaters, and museums, including the Louvre and the Musée d'Orsay. Paris is a city of people, with its diverse population and vibrant culture. It is a city of history, with its rich history and cultural heritage. The city is also a city of innovation

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn and adapt to new situations more effectively.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, it is likely to be used in even more areas, including diagnosis, treatment, and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm an experienced [Title]. I have a passion for [Describe your favorite hobby or activity]. What's your area of expertise? I enjoy [Name], so [Name] has helped me solve some of [Name's] most difficult problems. I'm a problem solver and team player. Can you tell me more about your background and how you got started in this field? Let me know if you'd like me to share any additional information. Thank you. **[Your Name]**  
**[Your Title]**  
**[Your Area of Expertise]**  
[Your Contact Information]  
[Your Resume,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the Île-de-France region of the country.

The statement is: **Paris, located in the **Île-de-France region of France**, is the capital of the country.** Paris is known for its rich history, art, and cuisine, and its status 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 your

 name

 here

].

 I

'm

 a

 [

insert

 your

 profession

 or

 title

 here

],

 with

 a

 passion

 for

 [

insert

 something

 related

 to

 your

 profession

 or

 hobbies

 here

].

 I

'm

 always

 looking

 for

 new

 challenges

 and

 adventures

 to

 take

 me

 on

,

 whether

 it

's

 through

 books

,

 movies

,

 or

 online

 gaming

.

 I

'm

 always

 eager

 to

 learn

 and

 grow

,

 and

 I

'm

 constantly

 seeking

 out

 new

 experiences

 to

 keep

 me

 up

-to

-date

 with

 the

 world

 around

 me

.

 My

 favorite

 hobbies

 are

 [

insert

 a

 hobby

 here

].

 I

 enjoy

 [

insert

 something

 related

 to

 my

 hobbies

 here

].

 I

'm

 passionate

 about

 [

insert

 something

 related

 to

 my

 hobbies

 here

].

 I

 hope

 to

 one

 day

 become

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 is

 fact

ually

 correct

 and

 provides

 a

 clear

 understanding

 of

 the

 capital

 city

's

 name

 and

 location

.

 If

 additional

 context

 is

 needed

,

 it

 would

 be

 helpful

 to

 provide

 more

 information

 about

 the

 specific

 facts

 or

 details

 about

 Paris

.

 However

,

 based

 on

 the

 information

 given

,

 the

 statement

 is

 accurate

 and

 comprehensive

.

 



If

 you

 would

 like

 to

 learn

 more

 about

 Paris

 or

 any

 other

 topic

,

 feel

 free to

 ask

,

 and

 I

'll

 do

 my

 best

 to

 provide

 a

 helpful

 response

.

 Let

 me

 know

 if

 there

's

 anything

 else

 I

 can

 assist

 with

.

 



Remember

,

 when

 providing

 factual

 statements

,

 it

's

 important

 to

 use

 clear

 and

 concise

 language

,

 and

 make

 sure

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 continue

 to

 evolve

 in

 a

 number

 of

 ways

,

 driven

 by

 advances

 in

 technology

 and

 in

 how

 we

 use

 and

 interact

 with

 AI

 systems

.

 Some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 integration

 of

 AI

 into

 everyday

 life

:

 As

 AI

 becomes

 more

 widely

 adopted

,

 it

's

 likely

 that

 we

'll

 see

 even

 more

 seamless

 integration

 into

 our

 daily

 lives

,

 from

 smart

 home

 technology

 to

 self

-driving

 cars

.



2

.

 Greater

 emphasis

 on

 ethical

 AI

:

 As

 AI

 systems

 become

 more

 complex

 and

 capable

,

 it

's

 likely

 that

 we

'll

 see

 more

 focus

 on

 developing

 ethical

 guidelines

 and

 standards

 for

 how

 AI

 is

 used

.



3

.

 Adv

ancements

 in

 AI

 for

 healthcare

:

 AI

 is

 already

 being




In [6]:
llm.shutdown()