# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-22 18:02:11] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=75.45 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=75.45 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.53it/s]Capturing batches (bs=2 avail_mem=75.39 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.53it/s]Capturing batches (bs=1 avail_mem=75.39 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.53it/s]Capturing batches (bs=1 avail_mem=75.39 GB): 100%|██████████| 3/3 [00:00<00:00, 10.64it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rami and my last name is El-Sayed. I am from Egypt. I used to be a teacher and my daughter's name is Rami Al-Sayed. She is 15 years old. She is a patient and very good student.
I am trying to persuade my daughter that after she graduates from college she should go to another school. The school I was told is called "Yusuf Dahab College". She has already taken one test to show that she is a good student. I want to know if it is possible to change this test and if I should not. How should I make this happen? I have
Prompt: The president of the United States is
Generated text:  running for a second term. He will have served in office for 26 years. For every year he serves as president, he will receive $100,000 in compensation per year. He is also receiving a bonus of $500,000 for his first term. How much compensation will he receive in total after 2 terms?
To determine the total compensation the president of the United States will receive after se

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you. I'm a [job title] at [company name], and I'm excited to meet you. I'm a [job title] at [company name], and I'm excited to meet you. I'm a [job title] at [company name], and I'm excited to meet you. I'm a [job title] at [company name],

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. It is a popular tourist destination known for its fashion, art, and cuisine. The city is also home to the French Riviera, a popular tourist destination for its warm weather and beaches. Paris is a vibrant and dynamic city with a diverse population and a rich cultural heritage. It is a city of contrasts

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, with the potential to revolutionize the way we treat and diagnose diseases.

2. AI in manufacturing: AI is already being used in manufacturing to optimize production processes and improve quality control. As AI technology continues to improve, we



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Age], [Name] who is [Current Job]. I am passionate about [What I do or love to do] because [What motivates me]. I have [number] years of experience in [Skill], and I am [ability]. I'm always [positive, determined, compassionate], and I enjoy [whatever makes me happy]. I believe in [why] because [why]. I hope to [what I hope to achieve], and I have [number] different passions that I'm always excited to explore. Thank you for asking!
[Name], how do you feel about the world?

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

The capital of France is Paris. That is the largest city in France, located on the Seine River and known for its rich history, art, and cultural attractions. It is also home to many important political, economic, and cultural institutions. The city has a population of

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Jane

.

 I

 am

 a

 skilled

 writer

 with

 a

 deep

 love

 for

 storytelling

 and

 a

 passion

 for

 creating

 engaging

 narratives

.

 I

 am

 confident

 and

 articulate

,

 and

 I

 am

 always

 up

 for

 a

 challenge

.

 My

 writing

 style

 is

 original

 and

 my

 work

 is

 always

 relevant

 and

 thought

-pro

v

oking

.

 I

 am

 passionate

 about

 sharing

 my

 ideas

 with

 the

 world

 and

 I

 am

 always

 looking

 for

 new

 opportunities

 to

 grow

 and

 learn

.

 Thank

 you

.

 Hello

,

 my

 name

 is

 Jane

.

 I

 am

 a

 skilled

 writer

 with

 a

 deep

 love

 for

 storytelling

 and

 a

 passion

 for

 creating

 engaging

 narratives

.

 I

 am

 confident

 and

 articulate

,

 and

 I

 am

 always

 up

 for

 a

 challenge

.

 My

 writing

 style

 is

 original

 and

 my

 work



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Paris

 is

 the

 political

,

 cultural

,

 economic

,

 and

 financial

 center

 of

 France

.

 Its

 charming

 and

 lively

 architecture

,

 rich

 history

,

 and

 vibrant

 arts

 scene

 make

 it

 a

 popular

 destination

 for

 tourists

 and

 locals

 alike

.

 The

 city

 is

 home

 to

 numerous

 museums

,

 museums

,

 and

 landmarks

,

 such

 as

 the

 Lou

vre

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Ch

amps

-

É

lys

ées

.

 It

 is

 also

 known

 for

 its

 gastr

onomy

,

 wine

,

 and

 cuisine

.

 Paris

 is

 a

 fascinating

 city

 with

 a

 rich

 cultural

 heritage

 and

 a

 strong

 sense

 of

 identity

.

 It

 is

 a

 city

 that

 has

 played

 an

 important

 role

 in

 French

 history

,

 politics

,

 and

 culture

 since

 the

 

1



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 progress

 and

 innovation

,

 driven

 by

 advances

 in

 computing

 power

,

 data

 storage

,

 and artificial

 intelligence techniques

.

 Some

 potential

 trends

 that

 may

 shape

 the

 AI

 landscape

 include

:



1

.

 Increased

 integration

 of

 AI

 with

 other

 technologies

:

 With

 the

 integration

 of

 AI

 with

 other

 technologies

 such

 as

 quantum

 computing

,

 bi

otechnology

,

 and

 robotics

,

 it

 is

 possible

 that

 we

 may

 see

 even

 more

 complex

,

 intelligent

 systems

 emerge

.



2

.

 AI

-driven

 healthcare

 advancements

:

 AI

-powered

 diagnostic

 tools

 and

 treatment

 plans

 may

 help

 improve

 the

 accuracy

 and

 efficiency

 of

 healthcare

 delivery

,

 leading

 to

 better

 patient

 outcomes

.



3

.

 AI

-driven

 transportation

 innovations

:

 Autonomous

 vehicles

 and

 other

 forms

 of

 AI

-driven

 transportation

 could




In [6]:
llm.shutdown()