# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-24 21:52:04] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.08it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.34it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.34it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.34it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.29it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  mehmet i don't know if I'm transgender now or not but I see my gender as female now because I have a smaller body. There are many transgender people and I'm not the only one. What does it feel like to be transgender now?

---

I want to ask if I feel like I’m physically and gendered differently now.

---

I've had sex with a man but it was past my period. What does that mean?

---

I'm not sure if I'm a man now or not. Should I tell someone?

---

Does it make sense to have sex with someone you are not sure of your gender with someone
Prompt: The president of the United States is
Generated text:  representing the United States in a foreign policy meeting. As he leaves the room, he bumps into a new acquaintance. The acquaintance, Mr. Lee, tells him that he just received a huge sum of money from a local business, which he wants to invest in some startup companies.

The president has some knowledge about the startup companies. After careful consi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also home to the French Parliament and the French National Library. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. It is a major tourist destination and a major economic center in Europe. The city is known for its cuisine, fashion, and art, and is home to many famous museums and galleries. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly together. It is a city that has played a significant role in the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some of the most likely future trends in AI:

1. Increased automation and artificial intelligence: As AI technology continues to improve, we can expect to see more automation and artificial intelligence in our daily lives. This could include things like self-driving cars, robots in manufacturing, and even more advanced forms of AI that can perform tasks that were previously done by humans.

2. Improved privacy and security: As AI becomes more integrated into our daily lives, we will need to be careful about how we use it.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [occupation or field of work] in [your job description]. I have a passion for [the thing that excites you most] and I'm always looking to learn and grow. I've been a [number of years] in my field, and I enjoy [something that's been a highlight or a lesson from a past experience].
[Insert the rest of the introduction, including any relevant details or facts about you that you would like to include in the self-introduction]. As a [occupation or field of work], I'm always looking for new challenges and opportunities to learn and grow. Whether it

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city that houses the historic Eiffel Tower and is home to the Louvre Museum and Notre-Dame Cathedral. Paris is known for its diverse and multicultural population, including French, French-speaki

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 self

-described

 [

Your

 profession

].

 I

 am

 a

 dedicated

 [

Your

 profession

]

 who

 has

 been

 in

 the

 field

 for

 [

Number

 of

 years

]

 years

.

 I

 enjoy

 [

Your

 profession

]

 through

 [

Your

 passion

 or

 hobby

].

 I

 also

 have

 a

 love

 for

 [

Your

 hobby

 or

 activity

].

 I

'm

 always

 looking

 for

 new

 experiences

 and

 challenges

 to

 try

.

 And

 I

'm

 always

 ready

 to

 learn

 and

 grow

.

 Looking

 forward

 to

 meeting

 you

!

 Remember

,

 the

 best

 way

 to

 make

 friends

 is

 to

 be

 yourself

,

 and

 be

 happy

!

 [

Your

 name

]

 [

Your

 profession

]


Your

 name

 is

 [

Name

],

 and

 I

'm

 a

 [

Your

 profession

]

 who



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Con

c

ise

 factual

 statement

:

 **

Paris

**

 is

 the

 **

capital

**

 of

 **

France

**

.

 



To

 ensure

 the

 accuracy

 of

 the

 statement

,

 you

 would

 need

 to

 include

:



-

 The

 capital

 city

's

 name

:

 Paris




-

 The

 country

's

 name

:

 France




-

 The

 primary

 function

 of

 the

 capital

 city

:

 **

E

valu

ating

**

 or

 **

Exam

ining

**

 the

 well

-being

 and

 governance

 of

 the

 entire

 country

.

 



However

,

 please

 note

 that

 "

Paris

"

 is

 a

 collective

 name

 used

 for

 the

 city

,

 and

 "

capital

"

 is

 specifically

 referring

 to

 its

 status

 as

 the

 largest

 and

 most

 populous

 city

 in

 France

.

 Therefore

,

 a

 more

 accurate

 statement

 would



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 promising

 and

 exciting

,

 with

 many

 possible

 trends

 shaping

 its

 development

 in

 the

 years

 to

 come

.

 Here

 are

 some

 of

 the

 key

 trends

 that

 could

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 automation

 and

 robotic

ization

:

 AI

 is

 already

 being

 used

 in

 various

 industries

,

 and

 its

 ability

 to

 automate

 and

 perform

 repetitive

 tasks

 will

 continue

 to

 grow

.

 This

 could

 lead

 to

 significant

 job

 losses

 in

 certain

 industries

,

 but

 it

 could

 also

 lead

 to

 new

 job

 creation

 as

 new

 AI

 technologies

 are

 developed

 and

 adopted

.



2

.

 Enhanced

 human

-A

I

 collaboration

:

 As

 AI

 technology

 improves

,

 it

 may

 become

 more

 capable

 of

 understanding

 human

 emotions

 and

 behavior

,

 leading

 to

 more

 effective

 and

 empath

etic

 interactions

 between

 humans

 and




In [6]:
llm.shutdown()