# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0817 00:30:30.236000 2987803 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0817 00:30:30.236000 2987803 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0817 00:30:40.643000 2988502 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0817 00:30:40.643000 2988502 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.69it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.78 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.78 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.92it/s]Capturing batches (bs=2 avail_mem=74.72 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.92it/s]Capturing batches (bs=1 avail_mem=74.72 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.92it/s]Capturing batches (bs=1 avail_mem=74.72 GB): 100%|██████████| 3/3 [00:00<00:00, 11.45it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Leo. I am a PhD student in environmental science at the University of Toronto. My interests include exploring the effects of climate change on ecosystems, evaluating the performance of national climate policies, and analyzing the environmental impact of building materials. My current research is funded by the University of Toronto and the U.S. Geological Survey, and has received funding from the Natural Sciences and Engineering Research Council of Canada and the Canadian Foundation for Innovation. My research has been recognized with the Sigma Xi Student Research Award, the Charles W. Dawson Award, and the Brock Award. During my time at the University of Toronto, I have published my research in Nature Communications,
Prompt: The president of the United States is
Generated text:  supposed to represent the interests of the United States. He is supposed to represent the people of the United States and not his own special interests. It is no exagg

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic Eiffel Tower, Notre-Dame Cathedral, and vibrant cultural scene. It is also the birthplace of French literature, art, and music. Paris is a bustling metropolis with a rich history and a diverse population. The city is known for its fashion, cuisine, and wine, and is a popular tourist destination. It is the largest city in France and the second-largest city in the European Union. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. The city is home to many famous museums, including the Louvre and the Musée d'Orsay. Paris

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely future trends in AI:

1. Increased automation: AI is likely to become more prevalent in many industries, including manufacturing, transportation, and healthcare. Automation will likely lead to increased efficiency and productivity, but it will also lead to job displacement for some workers.

2. Enhanced privacy and security: As AI becomes more advanced, there will be a greater emphasis on protecting user data and preventing cyber attacks. This will require more advanced privacy and security measures, such as encryption and biometric



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Character Name]. I'm a [occupation] who has always been [short answer about your occupation]. I'm really excited about my new job and can't wait to learn more about my new field of work. How can I help you today? [Character Name] will be looking for someone with a strong work ethic and a passion for [occupation]. They're looking for someone who is [short answer about their current job], who is [short answer about their current position], and who can provide [specific experience or skill]. [Character Name] is a[short answer about their current occupation]. I'm really looking forward to [short

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

I apologize, but there seems to be a misunderstanding. France is not a country but a continent, and the capital of a continent is called a "capital city" in Englis

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 a

 [

Job

 Title

/

Position

]

 with

 over

 [

Number

 of

 Years

]

 years

 of

 experience

 in

 [

Industry

 or

 field

].

 I

 thrive

 in

 [

Your

 Profession

],

 [

Your

 Profession

's

 Best

 Attributes

 or

 Hab

its

].

 I

 am

 a

 [

Your

 Profession

's

 Title

 or

 Grade

]

 who

 has

 always

 been

 [

Your

 Profession

's

 Passion

 or

 Interest

].

 I

 am

 also

 a

 [

Your

 Profession

's

 Challenge

 or

 Goal

].

 If

 you

're

 interested

 in

 learning

 more

 about

 me

,

 feel

 free

 to

 ask

 me

 anything

!

 Let

's

 connect

!

 [

Your

 Name

]

 [

Your

 Phone

 Number

]

 [

Your

 Email

 Address

]

 [

Your

 LinkedIn

 Profile

 Link

]

 [

Your

 Twitter

 Profile

 Link

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Palace

 of

 Vers

ailles

.

 Paris

 is

 also

 renowned

 for

 its

 rich

 culture

 and

 cuisine

,

 and

 is

 a

 popular

 tourist

 destination

 for

 many

.

 Paris

 is

 a

 vibrant

 and

 diverse

 city

 with

 a

 rich

 history

 and

 influence

 on

 world

 culture

.

 In

 addition

,

 it

 is

 a

 major

 financial

 center

 and

 host

 of

 the

 French

 presidential

 election

.

 The

 city

 is

 known

 for

 its

 music

,

 fashion

,

 and

 film

 industries

,

 as

 well

 as

 its

 fine

 dining

 and

 art

 scene

.

 Paris

 is

 a

 city

 of

 contrasts

 and

 a

 city

 of

 beauty

,

 and

 it

 continues

 to

 grow



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 continue

 to

 evolve

 rapidly

,

 with

 a

 number

 of

 potential

 trends

 shaping

 how

 the

 technology

 will

 be

 used

 and

 interact

 with

 the

 world

.



One

 of

 the

 most

 significant

 trends

 in

 AI

 is

 the

 growing

 importance

 of

 data

 in

 decision

-making

.

 As

 AI

 becomes

 more

 advanced

 and

 capable

 of

 analyzing

 large

 amounts

 of

 data

,

 it

 will

 become

 increasingly

 valuable

 in

 helping

 businesses

 make

 informed

 decisions

 about

 customer

 interactions

, marketing

 strategies,

 and

 more

.

 This

 will

 require

 the

 development

 of

 new

 algorithms

 and

 models

 that

 can

 interpret

 and

 utilize

 the

 data

 effectively

.



Another

 trend

 is

 the

 increasing

 adoption

 of

 AI

 in

 healthcare

.

 AI

 has

 the

 potential

 to

 revolution

ize

 medical

 diagnosis

 and

 treatment

,

 by

 analyzing

 medical

 images

 and

 data

 from




In [6]:
llm.shutdown()