# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0825 06:32:37.767000 123655 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0825 06:32:37.767000 123655 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0825 06:32:47.549000 124169 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0825 06:32:47.549000 124169 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.16it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.89 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.89 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.92it/s]Capturing batches (bs=2 avail_mem=74.83 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.92it/s]Capturing batches (bs=1 avail_mem=74.82 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.92it/s]Capturing batches (bs=1 avail_mem=74.82 GB): 100%|██████████| 3/3 [00:00<00:00, 11.32it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex and I am a medical doctor. I will be here in America for 3-4 months.
I'm afraid of the flu, but I'm also very sick. I'm going to be in America tomorrow. 

My doctor is having a flu shot and will also be going to see me tomorrow.

Is there a way to catch the flu at home? Or is it spread by others? Thank you!

---

**Update 1:**
I got to know that it is a bit risky to go to the doctor if you are sick and there are very few patients. 

I don't think it is necessary to go to the
Prompt: The president of the United States is
Generated text:  a standing officer of the military. ________
A. 正确
B. 错误
答案:

A

当井下发生大面积的火灾时，应采取的措施有：____
A. 除烟
B. 灭火
C. 抑制
D. 掩护
答案:

ABCD

推进“一带一路”建设、加强国际合作，要以( )为主题，以( )为动力，以____为保障，实现政策沟通、设施联通、贸易畅通、资金融通、民心相通。
A. 互学互鉴
B. 开放合作
Prompt: The capital of France is
Generated text:  [ ]

A. Paris  
B. Nice  
C. London  
D. Dublin

To determine the capital of France, let's consider the following points:

1. Paris is the capita

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your job or profession]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm always looking for new ways to improve myself and expand my knowledge. What's your favorite hobby or activity? I enjoy [insert a short description of your favorite activity]. I'm always looking for new ways to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its cuisine, fashion, and art scene. Paris is a major tourist destination and a cultural hub for Europe. It is home to many world-renowned museums, theaters, and art galleries. The city is also known for its vibrant nightlife and its role in the French Revolution and the French Revolution. Paris is a city of contrasts, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems are likely to become more integrated with human intelligence, allowing them to learn and adapt to new situations more effectively.

2. Enhanced machine learning capabilities: AI systems are likely to become even more capable of learning and making decisions on their own, with the ability to learn from experience and improve over time.

3. Greater reliance on data: AI systems will become more reliant on large amounts of data to learn and make decisions, with the ability to analyze and interpret data in ways that are difficult for humans to do.

4. Increased



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Job Title]. I've always been passionate about [One or Two Specific Areas of Interest]. I have a [Number] of years of experience in [Field of Expertise]. My [Skill or Profession] is... [Your Skill/Profession] and I am [Your Profession]. I'm always looking to learn new things and grow my skills. I enjoy [One or Two Interests/Activities]. I believe in [One or Two Core Values/Personality Traits]. I am [Your Character Name], and I am eager to share my experiences and interests with anyone interested. Thank you for considering

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
That's correct! Paris is the capital of France, located on the left bank of the Seine river in the south of the country. It is one of the most iconic cities in the world and a major center of the arts, culture, and comme

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Occup

ation

 or

 Skill

]

 with

 an

 [

Interest

 or

 Hobby

]

 in

 [

Main

 Interest

].

 I

'm

 passionate

 about

 [

Describe

 your

 main

 interest

 or

 hobby

].

 I

 love

 [

What

 motiv

ates

 you

 to

 pursue

 this

 interest

 or

 hobby

].

 I

 believe

 in

 [

What

 you

 believe

 in

,

 such

 as

 self

-im

pro

vement

,

 kindness

,

 or

 overcoming

 challenges

].

 I

 am

 [

Tell

 me

 about

 your

 personality

 type

,

 such

 as

 intro

verted

,

 ext

ro

verted

,

 ext

ro

verted

 with

 intro

verted

 traits

,

 etc

.

].

 I

 enjoy

 [

What

 you

 enjoy

 doing

,

 such

 as

 reading

,

 hiking

,

 or

 cooking

].

 I

 am

 [

Tell

 me

 about

 your

 physical



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 is

 concise

 and

 to

 the

 point

.

 It

 captures

 the

 most

 important

 details

 about

 Paris

,

 including

 its

 capital

 status

,

 its

 name

,

 and

 its

 location

.

 By

 stating

 "

the

 capital

 of

 France

"

 and

 "

Paris

,

 "

 we

 can

 summarize

 the

 main

 point

 of

 the

 statement

 in

 a

 single

 sentence

.

 The

 statement

 is

 clear

,

 easy

 to

 remember

,

 and

 provides

 a

 quick

 overview

 of

 Paris

's

 importance

 in

 French

 history

 and

 culture

.

 It

 can

 also

 be

 used

 as

 a

 starting

 point

 for

 further

 research

 or

 discussion

 about

 the

 city

.

 Overall

,

 this

 is

 a

 good

 example

 of

 a

 simple

 yet

 effective

 factual

 statement

 about

 a

 city

 in

 France

.

 



While

 "

Paris

"

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 and

 there

 are

 many

 possible

 trends

 that could

 shape

 the

 technology

's

 direction

.

 Here

 are

 some

 potential

 areas

 of

 development

 and

 change

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 With

 the

 increasing

 use

 of

 AI

 for

 disease

 diagnosis

,

 drug

 discovery

,

 and

 personalized

 medicine

,

 we

 may

 see

 more

 widespread

 adoption

 of

 AI

 in

 healthcare

 in

 the

 coming

 years

.

 This

 could

 lead

 to

 better

 diagnoses

,

 faster

 drug

 development

,

 and

 more

 personalized

 treatments

.



2

.

 Autonomous

 vehicles

:

 Autonomous

 vehicles

 are

 already

 a

 reality

,

 but

 there

 is

 still

 a

 lot

 of

 development

 and

 improvement

 that

 needs

 to

 be

 done

 before

 they

 become

 widespread

.

 AI

 will

 play

 a

 major

 role

 in

 the

 development

 of

 autonomous

 vehicles




In [6]:
llm.shutdown()