# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0814 22:21:52.232000 927411 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 22:21:52.232000 927411 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0814 22:22:01.132000 927945 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 22:22:01.132000 927945 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.45it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.56it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.56it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.56it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  6.67it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Bob, and I'm 24 years old. I've been working as a software engineer since 2011, and I'm currently a software developer at a startup called JAF. As the manager, my role is to oversee the entire team and ensure that everyone is working efficiently and effectively. My role also includes communicating with the business development team to stay informed on our clients' needs and growth. I'm passionate about my work and aim to continually improve my skills and knowledge. I have a diverse background, with an IT background and a background in marketing, and I enjoy the challenge of working with a new team every day
Prompt: The president of the United States is
Generated text:  trying to decide whether to send his staff to Europe to work or to the United States. He has a certain number of people in his staff, with an equal number in each city. The president's secretary and the president are in different cities, and they are separated by a distance of 1

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. What do you do for a living? I'm always looking for new challenges and opportunities to learn and grow. What do you like to do in your free time? I enjoy reading, playing sports, and spending time with my family. What's your favorite hobby? I love [insert a short description of your favorite hobby].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant cultural scene. It is also home to the Louvre Museum, the most famous art museum in the world, and the Notre-Dame Cathedral, which is a UNESCO World Heritage Site. Paris is a bustling city with a rich history and a diverse population, making it a popular tourist destination. It is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. The city is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Overall, Paris is a city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased automation and robotics: AI is already being used in manufacturing, healthcare, and transportation, and we can expect to see even more automation and robotics in the future. This will lead to increased efficiency, lower costs, and improved quality of life for humans.

2. Improved natural language processing: AI is already being used in chatbots, virtual assistants, and other forms of natural language processing. We can expect to see even more improvements in this area in the future, leading to more accurate and intuitive AI systems.

3. Enhanced privacy and security: AI systems are



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm an experienced [field or occupation] with over [number of years] years of experience. I am highly skilled in [specific skill or area of expertise], and I have always been passionate about [reason for interest, such as travel, education, or spirituality]. I have a passion for [reason for interest, such as helping others, making people happy, or solving problems]. I am confident in my abilities and always strive to be the best I can be. I am always ready to learn and grow, and I am always willing to help others. I am an active member of the [professional group or organization

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. France's capital city, Paris, is known for its iconic landmarks such as

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 your

 name

],

 and

 I

'm

 a

 [

insert

 your

 occupation

 or

 profession

]

 with

 a

 passion

 for

 [

insert

 your

 hobby

 or

 interest

].

 I

'm

 always

 up

 for

 a

 challenge

 and

 have

 a

 knack

 for

 turning

 problems

 into

 solutions

.

 I

'm

 a

 reliable

,

 trustworthy

 person

 who

's

 always

 ready

 to

 lend

 a

 helping

 hand

 whenever

 needed

.

 I

 love

 to

 work

 hard

 and

 try

 my

 best

 to

 achieve

 my

 goals

.

 I

'm

 always

 trying

 to

 stay

 up

 to

 date

 with

 the

 latest

 trends

 and

 technology

 to

 stay

 ahead

 of

 the

 curve

.

 I

'm

 always

 eager

 to

 learn

 and

 improve

 my

 skills

 and

 knowledge

.

 I

'm

 a

 positive

,

 proactive

 person

 who

 always

 tries

 to

 make

 a

 positive

 impact

 in



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 Europe

 and

 one

 of

 the

 largest

 cities

 in

 the

 world

,

 with

 a

 population

 of

 over

 

2

 million

 people

.

 Paris

 is

 known

 for

 its

 historical

 landmarks

,

 cultural

 attractions

,

 and

 vibrant

 street

 life

.

 The

 city

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

,

 among

 others

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 has

 played

 a

 significant

 role

 in

 French

 and

 European

 history

.

 The

 city

 is

 also

 home

 to

 many

 important

 institutions

 such

 as

 the

 French

 Academy

,

 the

 French

 Embassy

,

 and

 the

 French

 Parliament

.

 With

 its

 rich

 history

 and

 modern

 culture

,

 Paris

 is

 a

 fascinating

 city

 to

 visit



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 exciting

,

 with

 many

 possibilities

 for

 different

 applications

 and

 developments

.

 Here

 are

 some

 of

 the

 possible

 trends

 that

 AI

 may

 follow

 in

 the

 future

:



1

.

 Increased

 automation

:

 One

 of

 the

 biggest

 trends

 in

 AI

 is

 the

 increasing

 automation

 of

 human

 tasks

.

 This

 could

 involve

 the

 creation

 of

 more

 advanced

 AI

 systems

 that

 can

 perform

 tasks

 such

 as

 data

 analysis

,

 natural

 language

 processing

,

 and

 image

 recognition

.



2

.

 Personal

ized

 AI

:

 Another

 trend

 is

 the

 development

 of

 AI

 that

 can

 be

 more

 personalized

.

 This

 could

 involve

 creating

 AI

 systems

 that

 learn

 from

 the

 data

 that

 is

 collected

 and

 adjust

 their

 behavior

 accordingly

.



3

.

 More

 accurate

 and

 detailed

 models

:

 AI

 models

 are

 getting

 more

 accurate




In [6]:
llm.shutdown()